This notebook conducts a small research on the performance and usefulneess of a data labeling NLU tool called HumanFirst. 88,015 datapoints were uploaded to HumanFirst, then the tool is used to label 1,782 datapoints. While doing so, all information such as the time, the number of datapoints, the hardness of each single action operating HumanFirst was gathered and put in hf-evaluation-data.csv file. We will have a deeper look into the data and build a neural network machine learning model to predict the time required for labeling thousands of datapoints with HumanFirst.
| Labeled | Total Datapoints |
|---|---|
1,782 |
88,015 |
- Note that this is just an approach and the actual time for your cases might be different. In general,
HumanFirstis a really good tool.- The term
datapointhere is being used to describe the same thing as text sentences.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline
# Load in the data
df = pd.read_csv('./data/hf-evaluation-data.csv')
num_of_valid_observations = len(df)
df = df.iloc[:num_of_valid_observations]
# Check info of the data
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 42 entries, 0 to 41 Data columns (total 34 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 No 42 non-null int64 1 ClusteringOn 42 non-null int64 2 ClusterSize 42 non-null int64 3 Granularity 42 non-null int64 4 UseTrainedModel 42 non-null int64 5 ExploreBySimilarity 42 non-null int64 6 Disambiguate 42 non-null int64 7 TotalNumOfIntents 42 non-null int64 8 ClusterNumOfSampleSentences 42 non-null int64 9 SentencesGenerallyRelated 42 non-null int64 10 SentencesGenerallyUnrelated 42 non-null int64 11 SentencesTruelyRelated 42 non-null int64 12 SentencesOutOfTruelyRelated 42 non-null int64 13 SentencesOutCluster 42 non-null int64 14 SentencesDividedInClusters 42 non-null int64 15 HighestMatchScore_OtherIntent1 29 non-null float64 16 HighestMatchScore_OtherIntent2 20 non-null float64 17 NewCluster 42 non-null int64 18 Intent 42 non-null object 19 SubIntent 8 non-null object 20 F1 42 non-null int64 21 Precision 42 non-null int64 22 Recall 42 non-null int64 23 Accuracy 42 non-null int64 24 AvgF1 42 non-null int64 25 AvgPrecision 42 non-null int64 26 AvgRecall 42 non-null int64 27 AvgAccuracy 42 non-null int64 28 Confidence 42 non-null int64 29 Coverage 42 non-null int64 30 SuccessfulClustering 42 non-null int64 31 OpHardness 42 non-null object 32 Time (min) 42 non-null float64 33 Note 17 non-null object dtypes: float64(3), int64(27), object(4) memory usage: 11.3+ KB
# Drop unuseful columns
hf_data = df.drop(
['No',
'ClusterSize',
'Granularity',
'HighestMatchScore_OtherIntent1',
'HighestMatchScore_OtherIntent2',
'Confidence',
'SentencesDividedInClusters',
'Note'
],
axis=1)
# Show the head of the dataframe
hf_data.head()
| ClusteringOn | UseTrainedModel | ExploreBySimilarity | Disambiguate | TotalNumOfIntents | ClusterNumOfSampleSentences | SentencesGenerallyRelated | SentencesGenerallyUnrelated | SentencesTruelyRelated | SentencesOutOfTruelyRelated | ... | Recall | Accuracy | AvgF1 | AvgPrecision | AvgRecall | AvgAccuracy | Coverage | SuccessfulClustering | OpHardness | Time (min) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 0 | 20 | 13 | 13 | 0 | 10 | 3 | ... | 77 | 99 | 88 | 89 | 87 | 99 | 24 | 1 | Complicated | 10.0 |
| 1 | 1 | 0 | 0 | 0 | 21 | 12 | 11 | 1 | 8 | 2 | ... | 92 | 99 | 89 | 90 | 88 | 99 | 24 | 1 | Normal | 4.0 |
| 2 | 1 | 0 | 0 | 0 | 21 | 10 | 10 | 0 | 9 | 1 | ... | 96 | 96 | 89 | 90 | 89 | 99 | 31 | 1 | Easy | 2.0 |
| 3 | 1 | 0 | 0 | 0 | 21 | 20 | 19 | 1 | 17 | 1 | ... | 96 | 99 | 90 | 90 | 89 | 99 | 31 | 1 | Easy | 1.0 |
| 4 | 1 | 0 | 0 | 0 | 21 | 20 | 20 | 0 | 14 | 6 | ... | 95 | 95 | 85 | 85 | 85 | 99 | 28 | 1 | Easy | 2.0 |
5 rows × 26 columns
# Check descriptive statistics of the data
hf_data.describe().transpose()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ClusteringOn | 42.0 | 0.547619 | 0.503761 | 0.0 | 0.00 | 1.0 | 1.0 | 1.0 |
| UseTrainedModel | 42.0 | 0.666667 | 0.477119 | 0.0 | 0.00 | 1.0 | 1.0 | 1.0 |
| ExploreBySimilarity | 42.0 | 0.285714 | 0.457230 | 0.0 | 0.00 | 0.0 | 1.0 | 1.0 |
| Disambiguate | 42.0 | 0.166667 | 0.377195 | 0.0 | 0.00 | 0.0 | 0.0 | 1.0 |
| TotalNumOfIntents | 42.0 | 26.071429 | 2.883027 | 20.0 | 23.50 | 28.0 | 28.0 | 31.0 |
| ClusterNumOfSampleSentences | 42.0 | 24.000000 | 27.178722 | 1.0 | 6.00 | 12.5 | 35.5 | 142.0 |
| SentencesGenerallyRelated | 42.0 | 23.952381 | 27.193931 | 1.0 | 6.00 | 12.5 | 35.5 | 142.0 |
| SentencesGenerallyUnrelated | 42.0 | 0.047619 | 0.215540 | 0.0 | 0.00 | 0.0 | 0.0 | 1.0 |
| SentencesTruelyRelated | 42.0 | 22.952381 | 26.889977 | 1.0 | 6.00 | 11.5 | 34.0 | 142.0 |
| SentencesOutOfTruelyRelated | 42.0 | 0.952381 | 1.695797 | 0.0 | 0.00 | 0.0 | 1.0 | 7.0 |
| SentencesOutCluster | 42.0 | 1.047619 | 1.780002 | 0.0 | 0.00 | 0.0 | 1.0 | 7.0 |
| NewCluster | 42.0 | 0.214286 | 0.415300 | 0.0 | 0.00 | 0.0 | 0.0 | 1.0 |
| F1 | 42.0 | 95.452381 | 4.109492 | 83.0 | 93.00 | 95.0 | 100.0 | 100.0 |
| Precision | 42.0 | 95.857143 | 4.291634 | 87.0 | 93.00 | 96.0 | 100.0 | 100.0 |
| Recall | 42.0 | 95.380952 | 5.323452 | 77.0 | 94.00 | 96.0 | 100.0 | 100.0 |
| Accuracy | 42.0 | 99.285714 | 1.348635 | 94.0 | 99.00 | 100.0 | 100.0 | 100.0 |
| AvgF1 | 42.0 | 88.761905 | 2.886550 | 84.0 | 86.00 | 88.5 | 91.0 | 93.0 |
| AvgPrecision | 42.0 | 89.428571 | 2.548485 | 85.0 | 87.25 | 89.5 | 92.0 | 94.0 |
| AvgRecall | 42.0 | 88.738095 | 3.004546 | 84.0 | 86.25 | 88.5 | 92.0 | 93.0 |
| AvgAccuracy | 42.0 | 99.404762 | 0.496796 | 99.0 | 99.00 | 99.0 | 100.0 | 100.0 |
| Coverage | 42.0 | 37.214286 | 7.872570 | 24.0 | 29.50 | 36.5 | 45.0 | 50.0 |
| SuccessfulClustering | 42.0 | 1.000000 | 0.000000 | 1.0 | 1.00 | 1.0 | 1.0 | 1.0 |
| Time (min) | 42.0 | 3.130952 | 2.756499 | 0.5 | 1.00 | 2.0 | 5.0 | 11.0 |
hf_data['Functionality'] = (hf_data['ClusteringOn'] +
(hf_data['ExploreBySimilarity'] * 2) +
(hf_data['Disambiguate'] * 3)).map({
1: 'ClusteringSentences',
2: 'ExploreBySimilarity',
3: 'DisambiguateIntents'
})
func_df = pd.DataFrame({
'Count': hf_data['Functionality'].value_counts(),
'Percentage': hf_data['Functionality'].value_counts() / len(hf_data['Functionality']) * 100
})
px.pie(func_df, values='Count', names=func_df.index)
hf_data['SuccessfulClustering'].sum() / len(hf_data) * 100
100.0
100%
The rate of successfully clustering data is `100%` - meaning that we are *successful* everytime, when we do any operational actions to group data into different intents, or to find similar data points, or to disambiguate intents. This indicates that our operational actions using HumanFirst will most likely help us to build/improve our `NLU` model all the time.
ru_df = hf_data[[
'ClusterNumOfSampleSentences',
'SentencesGenerallyRelated',
'SentencesGenerallyUnrelated',
'SentencesTruelyRelated',
'SentencesOutOfTruelyRelated',
'SentencesOutCluster'
]]
ru_df.describe().transpose()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ClusterNumOfSampleSentences | 42.0 | 24.000000 | 27.178722 | 1.0 | 6.0 | 12.5 | 35.5 | 142.0 |
| SentencesGenerallyRelated | 42.0 | 23.952381 | 27.193931 | 1.0 | 6.0 | 12.5 | 35.5 | 142.0 |
| SentencesGenerallyUnrelated | 42.0 | 0.047619 | 0.215540 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| SentencesTruelyRelated | 42.0 | 22.952381 | 26.889977 | 1.0 | 6.0 | 11.5 | 34.0 | 142.0 |
| SentencesOutOfTruelyRelated | 42.0 | 0.952381 | 1.695797 | 0.0 | 0.0 | 0.0 | 1.0 | 7.0 |
| SentencesOutCluster | 42.0 | 1.047619 | 1.780002 | 0.0 | 0.0 | 0.0 | 1.0 | 7.0 |
Looking at the above statistics, we can see that
24 sample sentences. Among those 24 sentences, 23.95 are generally related to each other (have similar meaning) while 0.05 are unrelated. If we more strictly consider the meaning of those 23.95 generally-related sample sentences, 22.95 are truely related to each other and can form a cluster (intent) of sentences, while only 0.95 are out of truely related. Roughly speaking, on average, each time HumanFirst suggests 24 text sentences, 22 or 23 are truely OK to assign to a cluster, 1 or 2 are not OK to assign to the cluster of the major 22 23.142 sentences each time (this number might differ based on how we use the tool).6 to 35 with 12 is the medium.
</font>px.bar(
ru_df,
barmode='group',
labels=dict(value='Number of Sentences', index='Sampling Session ID', variable='Related/Unrelated Categories'),
)
f1a_df = hf_data[[
'F1',
'Precision',
'Recall',
'Accuracy',
'AvgF1',
'AvgPrecision',
'AvgRecall',
'AvgAccuracy'
]]
f1a_df.describe()
| F1 | Precision | Recall | Accuracy | AvgF1 | AvgPrecision | AvgRecall | AvgAccuracy | |
|---|---|---|---|---|---|---|---|---|
| count | 42.000000 | 42.000000 | 42.000000 | 42.000000 | 42.000000 | 42.000000 | 42.000000 | 42.000000 |
| mean | 95.452381 | 95.857143 | 95.380952 | 99.285714 | 88.761905 | 89.428571 | 88.738095 | 99.404762 |
| std | 4.109492 | 4.291634 | 5.323452 | 1.348635 | 2.886550 | 2.548485 | 3.004546 | 0.496796 |
| min | 83.000000 | 87.000000 | 77.000000 | 94.000000 | 84.000000 | 85.000000 | 84.000000 | 99.000000 |
| 25% | 93.000000 | 93.000000 | 94.000000 | 99.000000 | 86.000000 | 87.250000 | 86.250000 | 99.000000 |
| 50% | 95.000000 | 96.000000 | 96.000000 | 100.000000 | 88.500000 | 89.500000 | 88.500000 | 99.000000 |
| 75% | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 91.000000 | 92.000000 | 92.000000 | 100.000000 |
| max | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 93.000000 | 94.000000 | 93.000000 | 100.000000 |
px.histogram(
f1a_df,
x=['F1', 'Accuracy', 'AvgF1', 'AvgAccuracy'],
barmode='group',
labels=dict(value='F1/Accuracy', variable='NLU Evaluation Metrics (%)'),
)
# Visualize the ratio between different level of the hardness of operations
op_df = pd.DataFrame({
'Count': hf_data['OpHardness'].value_counts()
})
px.pie(
op_df,
values='Count', names=op_df.index,
color_discrete_sequence=px.colors.sequential.Aggrnyl
)
# Visualize the relationship between 'operation hardness' and 'number of sample sentences labeled by different operations'
px.sunburst(
hf_data,
path=['OpHardness', 'Functionality'],
values='ClusterNumOfSampleSentences',
height=550
)
hf_data[[
'Time (min)',
'ClusterNumOfSampleSentences',
'SentencesGenerallyRelated',
'SentencesGenerallyUnrelated',
'SentencesTruelyRelated',
'SentencesOutOfTruelyRelated',
'SentencesOutCluster',
]].describe().transpose()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Time (min) | 42.0 | 3.130952 | 2.756499 | 0.5 | 1.0 | 2.0 | 5.0 | 11.0 |
| ClusterNumOfSampleSentences | 42.0 | 24.000000 | 27.178722 | 1.0 | 6.0 | 12.5 | 35.5 | 142.0 |
| SentencesGenerallyRelated | 42.0 | 23.952381 | 27.193931 | 1.0 | 6.0 | 12.5 | 35.5 | 142.0 |
| SentencesGenerallyUnrelated | 42.0 | 0.047619 | 0.215540 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| SentencesTruelyRelated | 42.0 | 22.952381 | 26.889977 | 1.0 | 6.0 | 11.5 | 34.0 | 142.0 |
| SentencesOutOfTruelyRelated | 42.0 | 0.952381 | 1.695797 | 0.0 | 0.0 | 0.0 | 1.0 | 7.0 |
| SentencesOutCluster | 42.0 | 1.047619 | 1.780002 | 0.0 | 0.0 | 0.0 | 1.0 | 7.0 |
# Visualize the time requiring for each operation hardness level
px.box(
hf_data, x='OpHardness',
y='Time (min)',
color='OpHardness',
points='all',
height=500,
)
# Visualize the relationship between time, operation hardness and the number of sample sentences can be clustered with each single operation
px.scatter(
hf_data, x='Time (min)',
y='ClusterNumOfSampleSentences',
color='OpHardness',
size='SentencesTruelyRelated',
height=500
)
time_df = pd.DataFrame({
'Labeled': [1782],
'Total Datapoints': [88015],
'Total Messages': [262966 + 9886],
'Coverage (%)': [50]
})
time_df['Labeled Ratio (%)'] = time_df['Labeled'] / time_df['Total Datapoints'] * 100
time_df['Rate of User Msgs (%)'] = time_df['Total Datapoints'] / time_df['Total Messages'] * 100
time_df['Rate of Operator Msgs (%)'] = 100 - time_df['Rate of User Msgs (%)']
time_df
| Labeled | Total Datapoints | Total Messages | Coverage (%) | Labeled Ratio (%) | Rate of User Msgs (%) | Rate of Operator Msgs (%) | |
|---|---|---|---|---|---|---|---|
| 0 | 1782 | 88015 | 272852 | 50 | 2.024655 | 32.257414 | 67.742586 |
hf_data[[
'Time (min)',
'ClusterNumOfSampleSentences',
'SentencesGenerallyRelated',
'SentencesGenerallyUnrelated',
'SentencesTruelyRelated',
'SentencesOutOfTruelyRelated',
'SentencesOutCluster',
]].describe().transpose()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Time (min) | 42.0 | 3.130952 | 2.756499 | 0.5 | 1.0 | 2.0 | 5.0 | 11.0 |
| ClusterNumOfSampleSentences | 42.0 | 24.000000 | 27.178722 | 1.0 | 6.0 | 12.5 | 35.5 | 142.0 |
| SentencesGenerallyRelated | 42.0 | 23.952381 | 27.193931 | 1.0 | 6.0 | 12.5 | 35.5 | 142.0 |
| SentencesGenerallyUnrelated | 42.0 | 0.047619 | 0.215540 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| SentencesTruelyRelated | 42.0 | 22.952381 | 26.889977 | 1.0 | 6.0 | 11.5 | 34.0 | 142.0 |
| SentencesOutOfTruelyRelated | 42.0 | 0.952381 | 1.695797 | 0.0 | 0.0 | 0.0 | 1.0 | 7.0 |
| SentencesOutCluster | 42.0 | 1.047619 | 1.780002 | 0.0 | 0.0 | 0.0 | 1.0 | 7.0 |
avg_time_per_operation = hf_data.describe().transpose()['mean'].loc['Time (min)']
avg_time_per_operation
3.130952380952381
avg_sentences_clustered_per_operation = hf_data.describe().transpose()['mean'].loc['SentencesTruelyRelated']
avg_sentences_clustered_per_operation
22.952380952380953
time_df['Total Labeling Time (min)'] = time_df['Labeled'] / avg_sentences_clustered_per_operation * avg_time_per_operation
time_df['Total Labeling Time (hour/min)'] = pd.to_datetime(
time_df['Total Labeling Time (min)'],
unit='m'
).dt.strftime('%H:%M')
time_df.transpose()
| 0 | |
|---|---|
| Labeled | 1782 |
| Total Datapoints | 88015 |
| Total Messages | 272852 |
| Coverage (%) | 50 |
| Labeled Ratio (%) | 2.024655 |
| Rate of User Msgs (%) | 32.257414 |
| Rate of Operator Msgs (%) | 67.742586 |
| Total Labeling Time (min) | 243.084025 |
| Total Labeling Time (hour/min) | 04:03 |
We need around `4` hours to label `1,782` data points that are about `2%` of the total `88,015` data points to build a new NLU model covering `50%` of the whole data points.</font>
sample_time_df = time_df[[
'Labeled',
'Total Datapoints',
'Total Labeling Time (min)',
]]
sample_time_df
| Labeled | Total Datapoints | Total Labeling Time (min) | |
|---|---|---|---|
| 0 | 1782 | 88015 | 243.084025 |
time_1000 = sample_time_df.iloc[0] * 1000 / sample_time_df.iloc[0]['Labeled']
time_2000 = sample_time_df.iloc[0] * 2000 / sample_time_df.iloc[0]['Labeled']
time_3000 = sample_time_df.iloc[0] * 3000 / sample_time_df.iloc[0]['Labeled']
time_5000 = sample_time_df.iloc[0] * 5000 / sample_time_df.iloc[0]['Labeled']
time_10000 = sample_time_df.iloc[0] * 10000 / sample_time_df.iloc[0]['Labeled']
prediction_time_df = pd.concat(
[sample_time_df.transpose(), time_1000, time_2000, time_3000, time_5000, time_10000],
axis=1
).transpose()
prediction_time_df['Labeled'] = prediction_time_df['Labeled'].map(round)
prediction_time_df['Total Datapoints'] = prediction_time_df['Total Datapoints'].map(round)
prediction_time_df['Total Labeling Time (hour/min)'] = pd.to_datetime(
prediction_time_df['Total Labeling Time (min)'],
unit='m'
).dt.strftime('%H:%M')
prediction_time_df
| Labeled | Total Datapoints | Total Labeling Time (min) | Total Labeling Time (hour/min) | |
|---|---|---|---|---|
| 0 | 1782 | 88015 | 243.084025 | 04:03 |
| 0 | 1000 | 49391 | 136.410788 | 02:16 |
| 0 | 2000 | 98782 | 272.821577 | 04:32 |
| 0 | 3000 | 148173 | 409.232365 | 06:49 |
| 0 | 5000 | 246956 | 682.053942 | 11:22 |
| 0 | 10000 | 493911 | 1364.107884 | 22:44 |
Let's first check the correlation between data features.
# Visualize correlation between data features
px.imshow(
hf_data.corr(),
height=700,
text_auto='.2f',
labels=dict(color='Correlation'),
color_continuous_scale=['#FBE8CB', '#F79D13']
)
# Remove redundant data features that are not really useful for our neural network
hf_data = hf_data.drop(
[
'TotalNumOfIntents',
'SentencesGenerallyUnrelated',
'SentencesOutOfTruelyRelated',
'SentencesOutCluster',
'Intent',
'SubIntent',
'F1',
'Precision',
'Recall',
'Accuracy',
'AvgF1',
'AvgPrecision',
'AvgRecall',
'AvgAccuracy',
'Coverage',
'SuccessfulClustering',
'Functionality',
],
axis=1
)
px.imshow(
hf_data.corr(),
height=500,
text_auto='.2f',
labels=dict(color='Correlation'),
color_continuous_scale=['#FBE8CB', '#F79D13']
)
# Show correlation between 'Time' and other data features
px.bar(
hf_data.corr()['Time (min)'].sort_values().drop('Time (min)'),
labels=dict(value='Correlation', index='')
)
Note that the OpHardness column contains categorical values. We need to transform them to numerical values so that machine learning algorithms can understand the data. Let's use Pandas get_dummies() method to do that.
# Convert categorical data feature 'OpHardness' to numerical dummy variables
hf_final_data = pd.get_dummies(hf_data, columns=['OpHardness'], drop_first=True)
from sklearn.model_selection import train_test_split
X = hf_final_data.drop('Time (min)', axis=1).values
y = hf_final_data['Time (min)'].values
# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
We will scale the data to a normalized form in which values will stay in between 0 and 1. This is a good practice to improve the performance of machine learning algorithms that will be used to train our artificial neural network later.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Let's build a neural network that can learn from our HumanFirst data and predict the time that requires for each operation.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
# Check number of input data features
num_of_data_features = X.shape[1]
num_of_data_features
10
# Define a fast-forward neural network model
model = Sequential()
model.add(Dense(units=num_of_data_features, activation='relu'))
# The Dropout layer randomly sets input units to 0 with a frequency of 0.5
# at each step during training time, which helps prevent overfitting.
# Inputs not set to 0 are scaled up by 1/(1 - rate) such that the sum
# over all inputs is unchanged.
model.add(Dropout(0.5))
model.add(Dense(units=num_of_data_features, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(units=num_of_data_features, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(units=1))
model.compile(loss='mse', optimizer='adam')
# Defina a callback function that will be used to signal the training process to
# stop early when the model is well trained. This is to avoid overfitting the model.
early_stop = EarlyStopping(
monitor='val_loss',
mode='min',
verbose=1,
patience=25
)
model.fit(
x=X_train,
y=y_train,
epochs=400,
# validation_data is the data on which to evaluate
# the loss and any model metrics at the end of each epoch.
# The model will not be trained on this data.
validation_data=(X_test, y_test),
verbose=0,
callbacks=[early_stop]
)
Epoch 00284: early stopping
<tensorflow.python.keras.callbacks.History at 0x24c09affe80>
model_loss = pd.DataFrame(model.history.history)
model_loss
| loss | val_loss | |
|---|---|---|
| 0 | 21.830187 | 13.024487 |
| 1 | 20.522224 | 12.958340 |
| 2 | 20.104691 | 12.895252 |
| 3 | 19.172279 | 12.833569 |
| 4 | 19.324076 | 12.775092 |
| ... | ... | ... |
| 279 | 8.958416 | 4.444560 |
| 280 | 12.800607 | 4.455398 |
| 281 | 9.250002 | 4.454210 |
| 282 | 7.513301 | 4.445971 |
| 283 | 6.086430 | 4.429697 |
284 rows × 2 columns
px.line(
data_frame=model_loss,
x=range(len(model_loss)),
y=model_loss.columns,
labels=dict(x='Training Epoch'),
title='Losses of Neural Network Model over Training and Validation Process'
)
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
$$ MAE(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| $$Mean Squared Error (MSE) is the most commonly used loss function for regression. The loss is the mean overseen data of the squared differences between true and predicted values, or writing it as a formula.
$$ MSE(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$Root Mean Squared Error (RMSE) is the squared root of the mean of the squared errors:
$$ RMSE(y, \hat{y}) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} $$where
from sklearn.metrics import mean_squared_error, mean_absolute_error, explained_variance_score
# Predict 'Time' for the test data set
predictions = model.predict(X_test)
predictions
array([[1.4933385],
[2.2544312],
[2.1273208],
[1.8432616],
[2.015727 ],
[1.2383702],
[1.4626192],
[1.7787102],
[1.8080543]], dtype=float32)
# Calculate errors
errors_df = pd.DataFrame({
'Mean Absolute Error': [mean_absolute_error(y_test, predictions)],
'Mean Squared Error (MSE)': [mean_squared_error(y_test, predictions)],
'Root Mean Squared Error': [np.sqrt(mean_squared_error(y_test, predictions))],
})
errors_df
| Mean Absolute Error | Mean Squared Error (MSE) | Root Mean Squared Error | |
|---|---|---|---|
| 0 | 1.686975 | 4.429697 | 2.104685 |
hf_data['Time (min)'].describe()
count 42.000000 mean 3.130952 std 2.756499 min 0.500000 25% 1.000000 50% 2.000000 75% 5.000000 max 11.000000 Name: Time (min), dtype: float64
pred_df = pd.DataFrame({
'Test Y': y_test,
'Model Predictions': pd.Series(predictions.reshape(len(y_test),))
})
# Calculate errors (differences between true values and predicted values)
pred_df['Error'] = pred_df['Test Y'] - pred_df['Model Predictions']
pred_df
| Test Y | Model Predictions | Error | |
|---|---|---|---|
| 0 | 4.0 | 1.493338 | 2.506662 |
| 1 | 6.0 | 2.254431 | 3.745569 |
| 2 | 5.0 | 2.127321 | 2.872679 |
| 3 | 1.0 | 1.843262 | -0.843262 |
| 4 | 5.0 | 2.015727 | 2.984273 |
| 5 | 1.0 | 1.238370 | -0.238370 |
| 6 | 0.5 | 1.462619 | -0.962619 |
| 7 | 2.0 | 1.778710 | 0.221290 |
| 8 | 1.0 | 1.808054 | -0.808054 |
px.scatter(
data_frame=pred_df,
x='Model Predictions',
y='Test Y',
# trendline='ols',
# trendline_color_override = 'red',
width=550,
height=450,
labels={'Test Y':'True Value'}
)
# Visualize the distribution of the error
sns.displot(pred_df['Error'], kind='kde');
# Randomly select a data point representing a HumanFirst operation
random_index = 6
single_operation = hf_final_data.drop('Time (min)', axis=1).iloc[random_index]
single_operation
ClusteringOn 1 UseTrainedModel 0 ExploreBySimilarity 0 Disambiguate 0 ClusterNumOfSampleSentences 11 SentencesGenerallyRelated 11 SentencesTruelyRelated 11 NewCluster 1 OpHardness_Easy 1 OpHardness_Normal 0 Name: 6, dtype: int64
type(single_operation.values.reshape(-1, len(single_operation))), len(single_operation)
(numpy.ndarray, 10)
single_operation.values.reshape(-1, len(single_operation))
array([[ 1, 0, 0, 0, 11, 11, 11, 1, 1, 0]], dtype=int64)
single_operation_transformed = scaler.transform(single_operation.values.reshape(-1, len(single_operation)))
single_operation_transformed
array([[1. , 0. , 0. , 0. , 0.07092199,
0.07092199, 0.07092199, 1. , 1. , 0. ]])
model.predict(single_operation_transformed)
array([[1.6101286]], dtype=float32)
# Check original real 'Time' value
hf_final_data.iloc[random_index]
ClusteringOn 1.0 UseTrainedModel 0.0 ExploreBySimilarity 0.0 Disambiguate 0.0 ClusterNumOfSampleSentences 11.0 SentencesGenerallyRelated 11.0 SentencesTruelyRelated 11.0 NewCluster 1.0 Time (min) 1.0 OpHardness_Easy 1.0 OpHardness_Normal 0.0 Name: 6, dtype: float64
def simulate_and_calculate_time(total_number_of_labeled_sentences: int):
"""
Simulates the whole process of labeling a set of sample sentences using HumanFirst.
:total_number_of_labeled_sentences: the total number of sample sentences to be labeled/clustered.
Returns a list of the form `[a, b]` where `a` is the total number of actually labeled sentences
and `b` is the total time required to label them.
"""
# Calculating average number of sample sentences can be labeled by each operation
avg_sample_sentences_per_operation = hf_final_data['SentencesTruelyRelated'].mean()
# Find the size of the random dataset simulating all operations required to label all 'total_number_of_labeled_sentences'
random_dataset_size = round(total_number_of_labeled_sentences / avg_sample_sentences_per_operation)
# Generate a random dataset
random_indexes = np.random.randint(len(hf_final_data), size=random_dataset_size).tolist()
random_dataset = hf_final_data.iloc[random_indexes].drop('Time (min)', axis=1)
# Transform (scale) the random dataset
random_dataset_transformed = scaler.transform(random_dataset)
# Use the trained neural network machine learning model to predict the time for each data point in the random dataset
predictions_for_random_dataset = model.predict(random_dataset_transformed)
# Return a list of the total number of actually labeled sentences and the total time required to label them
return [random_dataset['SentencesTruelyRelated'].sum(), predictions_for_random_dataset.sum()]
simulate_and_calculate_time(1000)
[858, 75.27155]
simulate_and_calculate_time(2000)
[1693, 147.76727]
simulate_and_calculate_time(3000)
[3505, 230.25568]
simulate_and_calculate_time(5000)
[4764, 380.19086]
simulate_and_calculate_time(10000)
[10412, 772.2549]
# Apply the simulation function to calculate the time required and the actual number of data points labeled
# for some specific cases
# Define a dataframe with the expected number of data points to be labeled
prediction_time_using_neural_network_df = pd.DataFrame({
'Expect to Label': [1782, 1000, 2000, 3000, 5000, 10000],
})
# Call the simulation function, that in turn will use the trained neural network model to predict the time
neural_network_prediction_result = prediction_time_using_neural_network_df['Expect to Label']\
.apply(simulate_and_calculate_time)\
.apply(lambda x: pd.Series(x))
neural_network_prediction_result.columns = ['Actually Labeled', 'Total Labeling Time (min)']
neural_network_prediction_result['Actually Labeled'] = neural_network_prediction_result['Actually Labeled'].map(round)
# Combine the result to the original prediction_time_using_neural_network_df dataframe
prediction_time_using_neural_network_df = pd.concat([prediction_time_using_neural_network_df, neural_network_prediction_result], axis=1)
# Convert the time in minutes format to hours and minutes
prediction_time_using_neural_network_df['Total Labeling Time (hour/min)'] = pd.to_datetime(
prediction_time_using_neural_network_df['Total Labeling Time (min)'],
unit='m'
).dt.strftime('%H:%M')
prediction_time_using_neural_network_df
| Expect to Label | Actually Labeled | Total Labeling Time (min) | Total Labeling Time (hour/min) | |
|---|---|---|---|---|
| 0 | 1782 | 2108 | 143.985016 | 02:23 |
| 1 | 1000 | 988 | 78.904427 | 01:18 |
| 2 | 2000 | 1983 | 152.544937 | 02:32 |
| 3 | 3000 | 3237 | 235.228729 | 03:55 |
| 4 | 5000 | 5671 | 400.612946 | 06:40 |
| 5 | 10000 | 10299 | 776.086487 | 12:56 |
# Compare with the naive method of calculating time that we estimated previously
prediction_time_df
| Labeled | Total Datapoints | Total Labeling Time (min) | Total Labeling Time (hour/min) | |
|---|---|---|---|---|
| 0 | 1782 | 88015 | 243.084025 | 04:03 |
| 0 | 1000 | 49391 | 136.410788 | 02:16 |
| 0 | 2000 | 98782 | 272.821577 | 04:32 |
| 0 | 3000 | 148173 | 409.232365 | 06:49 |
| 0 | 5000 | 246956 | 682.053942 | 11:22 |
| 0 | 10000 | 493911 | 1364.107884 | 22:44 |